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Conditional Akaike information nnder covariate shift with 
application to small area prediction 

Yuki Kawakubo* Shonosuke Sugasawa^ and Tatsuya Kubokawa^ 


Abstract 


In this study, we consider the problem of selecting explanatory variables of fixed effects 
in linear mixed models under covariate shift, which is when the values of covariates in the 
predictive model differ from those in the observed model. We constru ct a variable selection 
criteri on based on the conditional Akaike information introduced by IVaida and Blanchard 
(200^. We focus especially on covariate shift in small area prediction and demonstrate 
the usefulness of the proposed criterion. In addition, numerical performance is investigated 
through simulations, one of which is a design-based simulation using a real dataset of land 
prices. 


Keywords: Akaike information criterion; Conditional AIC; Covariate shift; Linear mixed 
model; Small area estimation; Variable selection. 


1 Introduction 


Linear mixed models have been studied for a long time theoretically, and have many applications, 
for example, longitudinal data analysis in biostatistics, panel data analysis in econometrics, and 
small area estimation in official statistics. The problem of selecting explanator y variables in 
l inear mixed models is important and has been investigated in the literature. iMiiller et al. 
( 2 OI 3 I I is a good survey on model selection in linear mixed models. 

When the purpose of the variable selecti on is to find a set o f significant variables for a good 
prediction, Akaike-type information criteria ( Akaike . 1973 . 1974l l are well-known methods. How¬ 
ever, the Akaike information criterion (AIC) based on marginal likelihood, which integrates out 
likelihood with respect to random effects, is not ap propriate when the prediction is focused on 
random effects. Then, Vaida and BlanchardI ( 20051 ) proposed considering Akaike-type informa¬ 
tion based on the conditional density given the random effects and proposed the conditional AIC 
(cAIC). Here, we provide a brief explanation of the cAIC concept. Let f{y\b, r/) be a conditional 
density function of y given b, where y is an observable random vector of the response variables, 
r/ is a vector of the unknown parameters, and b is a random vector o f the random effects. Let 
'7r(&|r7) be a density function of b. Then, Vaida and Blanchard ( 2005l l proposed measuring the 
prediction risk of the plug-in predictive density f{y\b, rj) relative to Kullback-Leibler divergence: 


log 


f{y\b^'n) 


f{y\b,r))dy 


/(y|6,?7)7r(f>|?7)dr/db, 
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where y is an independent replication of y given b, and b and rj are some predictor or estimator 
of b and rj, respectively. The cAIC is an (asymptotically) unbiased estimator of a part of the 
risk in ([ll), which is called the conditional Akaike information (cAI), given by 


cAI = -2 


log {fiy\b, r))} f{y\b, rj)f{y\b, r])7r{b\rj)dydydb. 


he c A IC as a variable selection c riterion in linear mixed models ha s been studied bvlLiang et al . _ 

(2008), Greven and KneibI ( 2O10l l. Srivastava and Kubokawa ( 2O10l ). Kuboka\^ ( 201ll ). lKawakubo and Kubokawa 


(j2014l ). and others. Furthermore, the cAI C has been con s tructed as a variable selec t ion crite¬ 


rion in generalized linear mi xed models by iDonohue et al.l (j201l|) , lYu and Yaul (j2012l l , I Yu et al 
( 2 OI 3 I I. Saefken et al. ( 2014 1. and others. 


Considering the prediction problem, it is often the case that the values of covariates in the 
predictive model are different from those in the observed model, which we call covariate shift. 
Here, we call the model in which y is the vector of the response variables the “observed model,” 
and we call the model in which y is the vector of the respon se variables the “pr edictive model.” 
It is noted that the term “covariate shift” was first used by Shimodaira ( 2000l l. who defined it 
as the situation in which the distribution of the covariates in the predictive model differs from 
that in the observed model. In this s tudy, although we treat the covariates as non-random, we 


use the same term “covariate shift” as Shimodaira ( 2000l ). Even when the information about the 


covariates in the predictive model can be used, most of the Akaike-type information criteria do 
not use it. This is because it is assumed that the predictive model is the same as the observed 
model for deriving the criteria. As for the abovementioned cAIC, the conditional density of y 
given b and that of y given b are the same, both of which are denoted by f{-\b, rj). On the other 
hand, under the covariate shift, the conditional density of y given b is different from that of y 
given b and is denoted by g{y\b, rj). When the aim of the variable selection is to choose the best 
predictive model, it is not appropriate to use the covariates only in the observed model. Then, 
we redefine the cAI under covariate shift, as follows. 


cAI = -2 


log {g(i/|6, r))} giy\b, r])f{y\b, r])TT{b\r])dydydb, 


and c o nstruct an information criterion as an asymptotically unbiased estimator of the cAI. 
Sat oh ( 1997 . 20001 ) considered a similar problem in the multivariate linear regression model 
and proposed a variable selection criterion. It is important to note that we do not assume 
that the candidate model is overspecified, in other words, that the candidate model includes 
the true model. Although most of the Akaike-type information criteria make the overspecified 
assumption, this is not appropriate for estimating the cAI under covariate shift. We discuss this 
point in Section 

As an important applicable example of covariate shift, we focus on small area prediction, 
which is based on a finite super-population model. We consider the situation in which we are 
interested in the finite subpopulation (area) mean of some characteristic and that some values 
in the subpopulation are observed through some sampling procedure. When the sample size in 
each area is small, the problem is ca ll ed small area estimation . For details about s mall area es¬ 
timation, see B,ao and Molina ( 20151 ). Datta and Chosh ( 20121 ). Pfeffermann ( 201,41 ). and others. 
The model-based approach in small area estimation often assumes that the finite population has 
a super-population with random effects and borrows strength from other areas to estimate (pre¬ 
dict) the small area (finite subpopulation) mean. The well-known unit-level model is the nested 
error regression mod el (NERM), which is a kind of linear mixed model, and was introduced by 
Battese et al. ( 1988l l. The NERM can be used when the values of the auxiliary variables for 
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the units with characteristics of interest (response variables in the model) are observed through 
survey sampling. This is the observed model in the framework of our variable selection proce¬ 
dure. On the other hand, two types of predictive model can be considered. One is the unit-level 
model, which can be used in the situation in which the values of the auxiliary variables are 
known for all units. The other is the area-level model, which can be used in the situation in 
which each mean of the auxiliary variables is known fo r each small area . The latter is often the 
case in official statistics and the model introduced by Fav and Herriot ( 1979l l is often used in 
this case. 

The rest of this paper is organized as follows. In Section [21 we explain the setup of variable 
selection problem. In Section [3l we define the cAI under covariate shift in linear mixed models 
and obtain an asymptotically unbiased estimator of the cAI. In Section 01 we provide an example 
of covariate shift, which is focused on small area prediction. In Section [5l we investigate the 
numerical performance of the proposed criteria by simulations, one of which is design-based 
simulation based on a real dataset of land prices. All the proofs are given in the Appendix. 


2 Setup of variable selection problem 

2.1 Class of candidate models 


We focus on the variable selection of the fixed effects. First, we consider the collection of 
candidate models as follows. Let n x pi^ matrix X{uj) consist of all the explanatory variables 
and assume that rank(X(t(j)) = In order to define candidate models by the index set j, 
suppose that j denotes a subset of a; = {1 ,... ,puj} containing pj elements, i.e., pj = #(j), and 
X{j) consists of pj columns of X{oj) indexed by the elements of j. We define the class of the 
candidate models hy J = V{uj), namely, the power set of to, in which we call uj the full model. 
We assume that the true model exists in the class of the candidate models J, which is denoted 
by j-f. It is noteworthy that the dimension of the true model is pj ^, which is abbreviated to p*. 

We next introduce the terms “overspecified” and “underspecified” models. Candidate model 
j is overspecified if X{uj)l3^ € 'R.\X{j)], wh ich means that X(uj)0^ is in the c olumn space 
of X(j ) following iFujikoshi and Satohl (Il997l ) or iKawakubo and Kubokawal (|2ni4l ) . The set of 
overspecihed models are denoted by J'+ = {j G J\j* C j}. On the other hand, candidate 
model j is underspecified when X{uj) % TZ\X{j)]. The set of underspecified models is denoted 
by J- = J \ J7+. It is important to note that most of the Akaike-type information criteria 
are derived under the assumption that the candidate model is overspecified. However, the 
assumption is not appropriate for considering the covariate shift, which is explained in Section 
[321 Thus, we derive the criterion without the overspecified assumption. 

In the following two subsections, we clarify the observed model and predictive model for 
deriving the criteria. 


2.2 Observed model 

The candidate observed model j is the linear mixed model 

y = X{j)/3j -|- Zbj Sj, (2) 

where y is an n x 1 observation vector of response variables, X{j) and Z are n x pj and n x q 
matrixes of covariates, respectively, (3j is a pj x 1 vector of regression coefficients, bj is a q x 1 
vector of random effects, and Sj is an n x 1 vector of random errors. Let bj and Sj be mutually 
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independent and bj ~ Mq{0, cr'jG), £j ~ cr'^R), where G and R are qxq and nxn positive 

definite matrixes and cj| is a scalar. We assume that G and R are known and a'j is unknown. 
The true observed model j* is 

y = -Xi(cu)/3^ + Zb^: + 

where b* ~ J\fq{0,alG), £* ~ Mn{0,alR) and (3^ is x 1 vector of regression coefficients, 
whose Pij — p* components are exactly 0 and the rest of the components are not 0. The fact 
that we can write the true model as the equation above implies that the true model exists in the 
class of candidate models J. Note that X (w) is n x matrix of covariates for the full model 
w. Then, the marginal distribution of y is 

where X) = ZGZ^ + R. For the true model, the conditional density function of y given b* and 
the density function of b* are denoted by /(?/|b*,/3^, cr^) and 7r(b*|cr^), respectively. 

2.3 Predictive model 

The candidate predictive model j is the linear mixed model, which has the same regression 
coefficients /3j and random effects bj as the candidate observed model j, but different covariates, 
namely, 

y = X{j)(3j + Zbj + £j, (3) 

where y \s m xl random vector of the target of prediction, X{j) and Z are m x pj and m x q 
matrixes of covariates whose columns correspond to those of X(j) and Z, respectively, and 
£j is m X 1 vector of random errors, which is independent of bj and £j and is distributed as 
£j ~ Mm{^,(^jR), where i? is a known m x m positive definite matrix. We assume that we 

know the values of X[j) and Z in the predictive model and that they are not necessarily the 
same as those of X{j) and Z in the observed model. We call this situation covariate shift. The 
conditional density function of y given bj for the model j is denoted by gj{y\bj, /3j,a‘j). 

The true predictive model j* is 

y = X{uj)(3^, + Zb^: + £*, 

where X{uj) is m x p^j matrix of covariates and e* ~ A/’m(0, alR). Then, the marginal distribu¬ 
tion of y is 

where X) = ZGZ + R. For the true model, the conditional density function of y given b* is 
denoted by g{y\b^, a^). 


3 Proposed criteria 


3.1 Conditional Akaike information under covariate shift 


The cAI introduced bv IVaida and BlanchardI (j2005l i measures the prediction risk of the plug-in 
predictive density gj{y\bj, Pj,a‘^), where Pj and dl are maximum likelihood estimators of Pj 


and cr|, respectively, which are given as 


3 , = {X{jfT.-^X{j))-^X{jfT.-^y, 

= (y- X(j)PjriJ-\y - X(j)P^)/n, 
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respectively, and bj is the empirical Bayes estimator of bj for quadratic loss, which is given by 

b,=GZ^'S-\y-Xij)P^). 


Then, the cAI under covariate shift is 


cAI = 


-2log{gj{y\bj,l3.,&‘^)} 


-1 , 


mlog{2Traj) + log \R\ + {y - X(j)/3 • - ZbjYR {y - X{j))(3j - Zbj)la 


where and denote expectation with respect to the joint distribution of {y,b^) ~ 

/(y|6*, /3*, (T^)7r(6*|(T*) and the conditional distribution of y given 6*, namely ~ /3*, cr^), 

respectively. Taking expectation with respect to y| 6* ~ g{y\b*-, and6*|y ^ Mq{bt,,a‘l{G— 

GZ^'E-^ZG)) for 6* = GZ^^-^{y - X(a;)/3J, we obtain 


where 


cAI = E 


-1 


mlog(27r(T^) + log \R\ + tr {R A) ■ a^/a^ + a 


^R a/a 


A = S - ZGZ^S-^ZGZ'^, 

a = (X(j)3^. - X(a;)/3J - ZGZ^'S~\Xij)P^ - X(a;)/3J. 


(4) 

(5) 


3.2 Drawback of overspecified model assumption 

Most of the Akaike-type information criteria are derived under the assumption that the candidate 
model includes the true model, namely, the overspecified assumption. Although the assumption 
seems to be too strong, the influence is restrictive in practice. This is because the likelihood 
part of the criterion is a naive estimator of the risk function, namely, the cAI in the context of 
the cAIC. 

Under the covariate shift situation, however, we cannot construct the likelihood part as a 
good estimator of the cAI. That is, the drawback of overspecified assumption is more serious in 
the situation of covariate shift than the usual one. In Section EH we show that an unbiased 
estimator of the cAI under the overspecified assumption cAU in (|22l) has large bias for estimating 
the cAI of the underspecified models. 

Thus, we evaluate and estimate the cAI directly both for the overspecified models and 
underspecified models in the following subsection, which is essential work in selecting variables 
in covariate shift. 


3.3 Evaluation and estimation of cAI 

We evaluate the cAI in Q both for the overspecified model and for the underspecified model. 
We assume that the full model uj is overspecified, that is, the collection of the overspecified 
models is not an empty set. In addition, we assume that the size of the response variable in 
the predictive model m is of order 0{n). 

When the candidate model j is overspecified, na^^jal follows the chi-squared distribution. 
Then, we can evaluate the expectation in Q exactly. However, for the underspecified model, 
this is not true. In this case, we asymptotically approximate the cAI as the following theorem. 
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Theorem 1 For the overspecified case, it follows that cAI = i?[mlog( 27 r(T|)] + log |-R| + R*, 
where 

R* = — 

n — pj — 2 

for 7 = tr + tr [R~^A{X{jy'E-^X{j))-^A^] and A = X{j) - ZGZ^'E-^X{j). For 

the underspecified case, cAI is approximated as 

cAI = E[mlog{ 27 ra‘j)] + log \R\ + R* + Ri + R2 + R3 + + 0 {n~^), ( 6 ) 

where 


Ri = 7(A -1), 

i?2 = 7 • 7T- ^{—2A^ + {pj + 4)A^ — {pj + 2)}, 

R3 = X-PIB^R~" BP Jal 


R 4 = n-H-2A3 + (p, + 4)A2} X PfB^R ^Bpjal, 

for A = 1/(1 + b), 

b = PlX{wY{P^ - PfiX{u:)PJ{nal), 

B = PjX{uj) - X(a;) + ZGZ^{P^ - Pj)X{io), 
P, = ^-^x{j){xijy's-^xij))-^xijy's-\ 

P^ = ^-^X{io){X{ujfi:^^X{io))-^X{iof'S-\ 


and 

P, = X(j)(X(jf s-i. 

When the candidate model j is overspecified, it follows that Ri, R 2 , R 3 , and R^ are exactly 0. 

Because the approximation of cAI in ([6]) includes unknown parameters, we have to provide 
an estimator of cAI for practical use. First, we obtain estimators of Ri and R 2 , which are 
polynomials of A. We define A, A^, and A^ as 

~ _ n-pjdl 
^ 2 ’ 

n-puj cr| 

— ^ {n-pj){n-pj + 2) ( 

(n - Pu){n - + 2) I d] y 


and 

^ ^ (n -pj)(n -pj + 2){n-pj +4) 
{n - Pui){n - puj + 2){n-p^ + 4) 


When j € J7+, it follows that 



-fif ~ Be 


n-puj 


Puj - Pj \ 

2 J’ 


where Be(-, •) denotes the beta distribution. This implies that F(\) = F(X^) = E{X^) = 1 for the 
overspecified case. For the underspecified case, on the other hand, it follows that E[{&‘^/&j)^] = 
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+ 0{n as n —>■ oo for k = 1,2, 3. Then, we obtain an estimator of R 2 in the approximation 
of cAI given by Q, which is given as follows: 

+ (7) 

n 

Because i?i is of order 0{n), we have to estimate A with higher-order accuracy in order to 
obtain an estimator of Ri whose bias is of order 0{n~^) for the underspecihed case. To this 
end, we expand E{X) up to 0{n~^) as 


eCx) 


n-p^ \Kq + Ki) 


^ _l_ —2A^ -b {pj -|- 2)A^ 
n 


^ + 0(n-2). 


Then, we obtain an estimator of Ri given as 




( 8 ) 


We now have the followin g lemm a, which can be proved using Appendix C and Appendix D 
of Kawakubo and Kubokawa ( 2014l i. 


Lemma 1 When the candidate model j is underspecified, Ri in Q and R 2 in dZD are asymp¬ 
totically unbiased estimators of Ri and R 2 , respectively, whose bias is of order 0{n~^), that 
is, 

E{^i) = Ri + 0{n-^) and E{^2) = R2 + 0{n-^). 

When the candidate model j is overspecified, it follows that E{Ri) = E{R 2 ) = 0. 

We next consider estimation procedures of iia and i?4, which are complex functions of un¬ 
known parameters. We observe R^ and R4 as functions of = (/ 3 *,cr^)^, that is, R^ = RsiiJ^), 
Ri = Rfirj^) and substitute their unbiased estimators rj = {(3 which are given by 

3 = 3. = (X(cuf 

= {y- X{u})]3f'E-^{y - X {uj)]3) / (n - p^). 


Then, the plug-in estimators of i?3 and R 4 are 

'R3 = R3iv) = X- 

% = R^{y) = n-i{-2p + (pj + 4)A2} X W B'^Ef^Bp/a'^, (9) 

where A = 1/(1 -|- (5) for 

5 = P^X{ojf{P^ - Pj)X{uj)^/{nd^). 

Because R 3 is of order 0(n), the plug-in estimator R 3 has second-order bias. Then, we correct 
the bias by analytical method based on Taylor series expansions. We observe that expectation 
of the plug-in estimator Rsirj) is expanded as 

ElR^m = Rsiv.) + + B 2 {v,) + 0(n-2), (10) 
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where Bi{r]^) is second-order bias and B 2 {r]^) is third-order bias of that is, Bi{ri^) = 0(1) 

and -02(^7*) = 0(n“^), respectively. Because j3 and are independent, it follows that 





df3^df3l 




2 {dalf 


E[{a^ 


J 5 


( 11 ) 


where E[{P - /3J{P - = a2(X(a;)'^S ^X{uj)) ^ and £'[(d^ - = 2{a‘^)‘^/{n - Puj). 

Second-order partial derivatives of i ?3 are given by the following lemma. 


Lemma 2 The second-order partial derivative of R^{ri^) with respect to [3^ is 

d^Rsiv,) f3lB^R~^B/3, f 2C 8Cp,pfC ] 

dp^d(3f 0-2 ^ \ 710-2(1-^5)2 712(0-2)2(1 + 5)3 j 

AB^R~^Bp^(3fC + AC/3^PfB^R~^B , 2 

- f 2 -, 2 n^r ,2 - + 2X B^R B al 

n{aiy{l + 5)2 

where C = X{ijjY— Pj)X{u}). The second-order partial derivative of Ryriy with respect 
to al is 


d'^R^iv*) 

(5(t2)2 


--1. 


fSfB^R Bp, 


O': 


X < - 


2 PICP, 


+ 


2 {Plcp,f 


7i(o-2)3(1 + 5)2 7i2(o-2)4(1 + 5)3 


'-1 


2PIB^R BP,PICP, 

71 ( 0 - 2 ) 4(1 + 5)2 


-1 . 


+ 2\-PlB^R BpJ{a., 


2\3 


When the candidate model f is overspecified, second-order bias Bi{rj,) can be simplified to 

Bi{rj,) = tr [B^R~^B{X{u;fi:-^X{oj))-^], 

because CP, = BP, = 0 and A = 1, which implies that {d‘^Rz{'n*))/{dP,dp'l) = 2B^R ^Bja^ 
and that { 8 “^Ryri,))j{da‘iy = 0. However, we cannot know which candidate models are over- 
specified. Then, we propose the following bias-corrected estimator of R^-. 


R3 = Rsiv) - Biiv)- 


( 12 ) 


Lemma 3 Both for the cases in which the candidate model j is overspecified and j is under- 

specified, R 3 and Ri in (fT^ and ([U]) are asymptotically unbiased estimators of R 3 and R^, whose 
bias is of order 0 {n~^), that is, 

EiRs) = R 3 + 0 {n-^), and .E()^) = i?4 + C>(n"^). 

Using i?i, R 2 , R 3 , and R 4 given by (0), m, d, and 0, respectively, we construct an 
estimator of cAI as follows: 

cAI = 77ilog(27ro-|) -|- log |i?| + i?* -|- i?! + i ?2 + .R 3 + Ra- (13) 


Then, we obtain the following theorem, which is shown by Theorem [T] and Lemmas [TH3l 
















Theorem 2 Both for the cases in which the candidate model j is overspecified and j is under- 
specified, cAI in (USD is a second-order asymptotically unbiased estimator o/cAI, that is, 

^(c5j) =cAI+ 0(n“^). 


When the sample size n is small, second-order accuracy seems not to be sufficient for the 
overspecified model. Actually, as the result in the simulation study shows, the estimate of the cAI 
of the true model has relatively large bias, although the estimation of the tr ue model is important. 
Moreo ver, some of the other information criteria, which include the cAIC of IVaida and Blanchard 
(l2nn,^ i. are exact unbiased estimators of the information of the overspecified candidate model. 
Thus, we should improve the estimators of Rs and R 4 to remove the bias that is of order 0{n~^). 
To this end, we adopt a parametric bootstrap method. 

Bootstrap sample is generated by 

= X{u;)P + Zb^ + 

where and are generated by the following distribution: 

fof ~ AA(0, and ~ W(0, 


Then, we use the following estimator of R 4 : 

i?4 = ‘^R^iv) - Ej^[R4^(ri^)] (14) 

where Ejj denotes expectation with respect to the bootstrap distribution and is = 

((3V,d2t)Tfor 

_ X{u})p^fY:-^iy'^ - X{u})P^)/{n - p,_,). 

As for R 3 , it follows from (jinj] that 

Ej^lRsiy^)] = Rsiv) + Bi{y) + B2{y) + Op{n-^). 

However, Biify) has a bias with order 0{n~^), that is, 

E[B^{r,)] = Bfiy,) + Bi^{y,) + 0{n-^), 

where Hii(77,„) = 0{n~^). Because this bias is not negligible when we want to estimate R^ with 
third-order accuracy, we estimate the bias by bootstrap method as follows: 

^i=E^[Bi{y^)]-Bi{y). 

Then, we obtain an estimator of i? 3 , which is given as 

= “^Rsiv) — + Eli 

= 2R^{y) - E^[R^{y^)] + E^[Bi{y^)] - B^ir,). (15) 


Lemma 4 Both for the cases in which the candidate model j is overspecified and j is underspec¬ 
ified, i ?3 and 424 in ([E]) and (fTTl) are asymptotically unbiased estimators of R 3 and R^, whose 
bias is of order 0 {n~‘^), that is, 

Ei'Rs) = R 3 + 0 {n-^), and E{^ 4 ) = Ri + 0 {n-‘^). 
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Using Ri, i? 2 , R 3 , and R 4 given by m, dZD, disi), and m, we obtain an estimator of cAI 
as follows: 

cAI^ = mlog{27raj) + log \R\ + R* + Ri + R 2 + R 3 + Ra, (16) 

which improves cAI in unbiasedness. Then, we obtain the following theorem, which is proved 
by Theorem [2] and Lemma [H 

Theorem 3 When the candidate model j is overspecified, cAI^ in (USD is a third-order asymp¬ 
totically unbiased estimator of cAl, that is, 

=cAI+ 0(n-2). 

When the candidate model j is underspecified, cAI^ is a second-order asymptotically unbiased 
estimator of cAI, that is, 

L;(cAi^) =cAI+ 0(n-i). 


4 Application to small area prediction 


A typical example of the covariate shift situation appears in the small area prediction problem. 
The model for small area prediction supposes that the observed small area data have a finite 
population, which has the sup er-population rnodel with random effects, one of which is the 
well-known NERM proposed by Battese et al. ( 1988I L 

Let Yik and xnfij) denote the value of a characteristic of interest and its pj-dimensional 
auxiliary variable for the kth unit of the ith area for i = 1,... ,q and k = 1,... ,Ni. Note that 
Xik{j) is a subvector of Xik{co), which is the vector of the explanatory variables in the full model 
Lv, and we hereafter abbreviate the model index j and write xn^ instead of Xik{j), p instead of 
Pj, etc. Then, the NERM is 


^ik = + bi + eik {i = l,...,q; k = l,..., Ni), 


(17) 


where /3 is a p x 1 vector of regression coefficients, bi is a random effect for the ith area, 
and biS and £ikS are mutually independently distributed as bi ~ AA(0,r^) and £ik ~ 
respectively. We consider the situation in which only Uj values of the l)fcS are observed through 
some sampling procedure. We define the number of unobserved variables in the ith. area by 
Ni — Ui = ri and let n = ni n^, r = ri r^. Suppose, without loss of generality, 

the first Ui elements of {Tji,..., are observed, which are denoted by yi,..., and 

Yi^m-ei^ ■ ■ ■ , Yi^]\f^ are unobserved. Then, the observed model is defined as 


Vik = xJkP + hi + £ik (i = 1,..., g; A: = 1,..., n*). 


(18) 




which corresponds to ([2]) with y = {yf,..., y'^Y for = {yn ,..., yi^mV, X = (X^, 
for Xi = {xii,... ,Xi^niV, Z = diag(Zi,...,Zq) for Zj = 1„., G = iflq and R = In-, where 
1„. denotes an n, x 1 vector of Is and . In the derivation of our proposed criteria, we 

assume that the covariance matrix of h is a^G for a known matrix G. However, in the NERM, 
G includes the parameter which is usually unknown and has to be estimated. In this case, 
we propose that G in the bias correction should be replaced with its plug-in estimator G{fi). 
The influence caused by the replacement may be limited bec ause ib is the nuisance paramete r 
when we are interested in selecting only explanatory variables. iKawakubo and Kubokawal (120141 ) 
discussed the problem in their Remark 3.1. 
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We consider two types of predictive models. The first can be used in the situation in which all 
ajjfcS are available. Then, the predictive model, which we call the “unit-level predictive model,” 
is defined by 


Yik = xJkP + bi + eik (i = 1,..., g; A; = n* + 1,..., W), (19) 

whiA corresponds to ([3]) with y = (yj,..., for y^ = ..., X = (Xi,...,Xg)T 

for Xi = ■■■, Z = diag (Zi,, Zg) for Zi = 1^, R = Ir Note that m = r. 

In the problem of small area prediction, we often encounter the situation in which all Xij-s are 
not observed but the area mean Xi = N~^ is known and we are interested in predicting 

Yi, which is the mean of finite population {Yu,... ,Yi^]y.}, by using the value of Sj. Then, the 
second type of predictive model, which we call the “area-level predictive model,” is defined as 


y iiu) = + bi + Ei^u) (* = 1,---,^), (20) 

where mean of unobserved variables, = r~^ 

calculated from Xi and {xn,..., Xim), and Ylk=n +i is distributed as ^(0, jr^). 

The model (1201) corresponds to ([3]) with y = (Y ..., Tg(u))'^, X = («i(„),..., Xg(^u)V, Z = Ig 
and R = diag {Ri ,..., Rg) for Ri = l/r*. Note that m = q. 

After selecting explanatory variables with our proposed criteria, we predict Tj(„) by the 

empirical best linear unbiased predictor yj(„) = + bi and obtain a predictor of the finite 

population mean Yi, which is given as 


rii 


y^ -- ' ^ ^ Vik T '^iyi(u) 

^ k=l 


Ni 


( 21 ) 


Thus, covariate shift appears in standard models for small area prediction and the proposed 
criterion is important and useful in such a situation. 


5 Simulations 

5.1 Measuring the bias of estimating the true cAI by the criteria 

In this subsection, we compare the performance of the criteria by measuring the bias of estimating 
the cAI. We consider a class of the nested candidate models ja = {1,...,a} for a = 1,... ,pu) 
where p^j = 7. The true observed model is the NERM in (|18p with = 1 and n* = = 3 

for i = 1,... ,q. We consider the unit-level predictive model ()19p for the hrst experiment and the 
area-level predictive model (1201) for the second experiment. The explanatory variables in the full 
model Xik{uj)’s {i = 1 ,... ,q] k = 1,..., Ni) are independently generated by ^(0, H®), where 
Sa; = 0.9/p„-F0.1Jp„ for Jp„ = lp,,lp„- The true coefficient vector (3^ is = (/3i,... ,/3p,,0,0) 
for p* = 5 and /3;s (1 < / < p*) are generated by /?; = 2 x ((—l)^/(/-1-0.7)) x 1/(1, 2) for a uniform 
random variable C/(l,2) on the interval (1,2). The values of the explanatory variables Xi^s and 
the vector of regression coefficients (3^ are fixed through simulations. 

For comparison, we consider the exact unbiased estimator, which is derived under the as¬ 
sumption that the candidate model is overspecihed, given by 

STr = mlog(27rd|) + log |i^| + (y - X{j)^^ - Zbj)'^R-\y - X{j)pj - Zbj)/&] + Acs, (22) 
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where the bias correction term Acs is 


Acs 


- - -|tr[i? ^A]+tr[i^ 

n — pj — 2 I ) 

n pj 


Table 1: Relative bias of estimating cAI by cAR, cAI and cAI for unit-level predictive model 


True value 

Relative bias 


Model 

cAI 

cAR 

cAI 

cAJ^ 


Pattern (a): q = 

10 


ji 

206.38 

-33.439 

-0.33371 

-0.080972 

h 

152.81 

-18.840 

-0.2414 

-0.16414 

k 

140.79 

-18.556 

-0.23026 

-0.46789 

k 

132.61 

-11.451 

0.26445 

-0.19514 

k 

116.51 

-0.0019291 

1.5979 

0.49514 

k 

122.46 

-0.050686 

0.77756 

0.15429 

k 

128.88 

0.0086256 

0.09468 

0.09468 


Pattern (b): q = 

15 


ji 

233.35 

-0.22534 

0.11255 

0.1159 

k 

189.54 

9.4310 

0.24145 

0.21667 

k 

177.28 

14.197 

0.42246 

0.37347 

k 

163.62 

-0.76563 

0.32597 

0.02934 

k 

152.94 

0.13627 

0.8566 

0.25115 

k 

156.73 

0.068668 

0.53817 

0.15002 

k 

161.65 

0.083897 

0.015869 

0.015869 


Pattern (c): q = 

20 


k 

299.12 

4.3775 

0.084838 

0.084654 

k 

252.60 

6.1677 

0.24072 

0.23581 

k 

250.27 

-5.1825 

-0.016634 

0.010911 

k 

208.25 

2.6115 

0.36013 

0.23929 

k 

197.52 

0.25682 

0.53321 

0.26101 

k 

200.38 

0.24977 

0.40855 

0.25647 

k 

203.57 

0.22713 

0.21272 

0.21272 


Tables [T] and [2] report the true values of the cAI and relative bias of estimating the cAI by 

the criteria cAI in (jl3ll . cAI in m, and cAR in (I22p . for the experiment using the unit-level 
predictive model and for that using the area-level predictive model, respectively. We handle 
the cases in which the number of the areas q = 10, 15, 20. The true values of cAI in each 
candidate model are calculated based on Q with 10,000 Monte Carlo iterations. The relative 
bias of estimating the cAI by the criteria is dehned as 


100 X 


E[IC] - cAI 
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Table 2: Relative bias of estimating cAI by cAR, 


cAI and cAI 


t 


for area-level predictive model 


True value 


Relative bias 


Model 

cAI 

cAR 

(5AI 



Pattern (a): q 

= 10 


ji 

61.095 

-10.429 

-0.27157 

-0.21359 

h 

46.635 

5.4222 

0.40292 

-0.055653 

h 

50.744 

-10.285 

0.36735 

-0.23857 

k 

47.323 

-0.77484 

1.3412 

0.36198 

jb 

45.735 

-0.21108 

2.2751 

0.60073 

k 

49.452 

-0.32466 

0.82334 

0.017969 

h 

52.805 

-0.29278 

-0.082743 

-0.082743 


Pattern (b): q 

= 15 


ji 

95.393 

-5.4716 

0.12331 

0.13930 

h 

70.056 

17.521 

0.39905 

0.36599 

h 

66.412 

21.039 

0.59611 

0.53872 

k 

61.310 

5.1740 

0.58453 

0.10853 

k 

60.532 

0.23723 

1.0853 

0.25515 

k 

63.109 

0.13276 

0.50306 

0.096935 

k 

65.379 

0.16648 

-0.0017143 

-0.0017143 


Pattern (c): q 

= 20 


ji 

98.841 

21.017 

0.18161 

0.17565 

h 

94.83 

10.059 

0.31607 

0.30894 

k 

88.346 

5.6464 

0.13835 

0.11302 

k 

78.383 

7.8734 

0.46942 

0.27593 

k 

77.794 

0.18930 

0.54202 

0.17009 

k 

79.277 

0.18908 

0.41365 

0.18534 

37 

81.216 

0.15395 

0.11783 

0.11783 
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where IC = cAI, cAI^, cAI^, and expectation is computed based on 1,000 replications. The 
bootstrap sample size is 1,000 for obtaining cAI^. From the tables, we observe the following 
facts. First, cAI„ has large bias for underspecified models, that is, ji, j 2 , js, and jh, while 
the modihed estimators of the cAI, cAI and cAl\ have smaller bias for both overspecified and 
underspecified models. Second, cAI can estimate the cAI more unbiasedly than cAI can for the 
case of small sample size because cAI^ is a third-order asymptotically unbiased estimator of the 
cAI. In particular, the improvement is remarkable for the true model js, which is important for 
variable selection. However, the relative bias of cAI, which is the second-order asymptotically 
unbiased estimator of the cAI, becomes smaller as the sample size becomes larger and the 

difference in performance between cAI and cAI is not very signihcant. 


5.2 Predicting finite popnlation mean 

In this subsection, we investigate the numerical performance of the small area prediction problem 
explained in Section [H We conduct design-based simulation based on a real dataset. We use the 
posted land price data along the Keikyu train line, which connects th e suburbs in Kanagawa pre¬ 


fectur e to the Tokyo metropolitan area. This dataset was also used bv iKawakubo and Kubokawa 
( 2014l f. who studied modihcation of the cAIC. 


We analyze the land price data in 2001 with covariates for 47 stations that we consider as 
small areas, and let q = 47. In the original sample, there are rij sampled land spots for the i 
area, and the total sample size is n = J2i=i = 1^9- We generate a synthetic population of size 
A^ = 1,000 by resampling with replacement from the original dataset using selection probabilities 
inversely pro portional to sa r nple w eights. This method of making a synthetic population was 
also used by Chandra et al. ( 2012l f. Then, we select 200 independent random samples, each of 
size n = 189, from the fixed synthetic population by sampling from each area based on simple 
random sampling without replacement and with sample size of each area equal to that of original 
dataset Uj. 

The characteristic of interest is the land price (Yen in hundreds of thousands) per m^ of the 
kih. spot in the fth area, denoted by Pjfc, and the target is the mean of the land price in each 
area Pi = A^“^ for f = 1,..., g, where W is the size of the fth area (subpopulation). 

As discussed in Section [U we adopt model-based estimation of hnite subpopulation mean Pi by 
using NERM. For selecting the explanatory variables in NERM, we use our propos ed criterion 
cAI in (jl3l) for comparison with the conventional cAIC bv IVaida and BlanchardI (120051 1. However, 
because the land price data are right-skewed, we undertake log-transformation, namely, Yik = 
log(Pjfc), and fit with NERM in (|17p . 

The dataset includes the following auxiliary variables. FARj^ denotes the floor-area ratio 
of spot k in the ith area, TRNj is the time it takes by train from station i to Tokyo station 
around 9:00 AM, DSTj^ is the geographical distance from spot k to nearby station i and FOOTj*, 
denotes the time it takes on foot from spot k to nearby station i. As the candidate explanatory 
variables, we consider 7 variables FARjfc, TRNj, TRN?, DSTj^, DST^^i,, EOOTjfc, and EOOT?^, 
which are denoted by xi,... ,X 7 , and xq = 1 denotes a constant term. Using the criteria, we 
select the best combination of these variables. ^ 

Based on the best model selected by the criteria, we obtain a predictor Pi of the finite 
subpopulation mean of the land price in the fth area. However, log-transformed variable 
is used in the NERM, and thus, we have to modify the predictor (1211) . The best predictor of 
out-of-sample Pik (i = 1,... , g; k = rii + 1,... ,Ni) is the conditional expectation given the 
data y, namely, Pik{y,0) = E[Pik \ y] = E[ex.];>{Yik) \ y], where 6 = (/3'^,r^,a^)'^. Because the 
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conditional mean and variance of Yik given y, denoted by Hik and Vi, are 


xJk'f3), + 2 I „.^ 2 ’ 

* k’=l * 

it follows that 

Pik{y,9) = ex.p{nik + Vi/2), 

Substituting 6 with 0, we obtain the empirical best predictor (EBP): 

= P,kiy,e), 

where 0 is some estimator of 6. Then, we use the following predictor of Py. 


P^ = 


1 

'Ni 


Ni 


^Pik+ 

k=l k=ni-\-l 


pEBP 

-^ik 


As 6, we use unbiased estimators of and proposed by Prasad and R,aol ( IQQfll ) and the GLS 
estimator of (3. 

We measure the performance of this design-based simulation by mean squared error (MSE) 
of the predictor, 

^ 200 ^ 


200 


t=i 


—(t) —P 

where P) and P are ith finite subpopulation mean and its predictor of the tth sampled data. 
We construct cAIs using the unit-level predictive model (fTOll and the area-level predictive model 
(j20l] . and let MSE)^ and MSEf denote the corresponding MSEs. To compare the performance, 
we compute the ratio of MSEs as follows: 


MSE" , MSEf 

-r, and -j- 

MSE)'^ MSE7^ 


where MSE)''’ is the MSE of the predictor based on the cAIC of Vaida and Blanchard ( 2005l V 
Eigured] shows the results. Alt hough the performance of c AI based on the unit-level predictive 
model is similar to the cAIC of IVaida and BlanchardI (j2005l V cAI based on the area-level predic¬ 
tive model has much better performances in most areas. It is valuable to point out that the MSE 
of the predictor of the finite subpopulation mean can be improved using our proposed criteria, 
which motivates us to use them for variable selection in the small area prediction problem. 
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Figure 1: Ratios of MSE based on cAIC of Vaida and Blanchard ( 2005l i to cAI based on area- 
level predictive model (solid line) and to cAI based on unit-level predictive model (dashed line) 
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A Appendix 


A.l Proof of Theorem [T] 

Because and (t| are mutually independent, the cAI in dH) can be rewritten as 


cAl = E[m\og{2'Ka‘^j)]+\og\R\+n ■ E'\(na‘j/al) ^]{tr(i? A) + E[a^R a/cr^]}, 

For the overspecified case, we can easily evaluate the cAI, noting that ndj follows chi-squared 
distribution with n — pj degrees of freedom and is independent of Pj. Thus, it suffices to show 
that the cAI is ev aluated as (l6l) for the unde r specifi ed case. 

From (B.4) in iKawakubo and Kubokawal (|2014l l. we can evaluate E[{naj /as follows: 


'-1 




A f ^ —2A^ -I- {Pj + 4)A 

n \ n 


+ 0(n-3). 


(23) 


We next evaluate E[a'^R ci/a‘1]. Let u = y — X{u})P^,. Then, we can rewrite X{j)Pj — X{ijj)P^ 
in a as 


X(j)3, - X{u)P, = Pj{X{u)P, + u)- Xico)P, 

= {PjX{oj) - X{oo))P^ + 

Next, we can rewrite X{j)Pj — X{lo)P^ in a as 

X{j)p^ - X{u)p, = a,E^/\MpW^P,+v) - W^p,} 

= - MPW^P, + M,v} 

= - E{P^ - Pj)X{oj)P, + 


Then, we obtain 

a = Bp^ + aPPjE^/'^ - 

Moreover, it follows that 

_ ZGZ^E-^^^Mj = (X{j) - ZGZ^E-^X{j)){X{jyE-^X{j))-^X{jyE-^/^ 
= A{X{jrE-^X{j))-^X{jrE-^/^ 


■-1 


Thus, E[a^R a/ al] can be evaluated as 

E[a'^R~\/al] = tr [R~^ A{X{jfE-^X{j)r^ A-^] + plBPR^Bpjal. 


(24) 


It follows from 


and 


that 


n-E\{KQ^KxY^\{\.Y{R ^A)^E\oPR ^a/al]} 


= (7 + PIB^R 'BpjaD X <1 A + 


which shows that the cAI is approximated to Q. 


□ 
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A.2 Proof of Lemma [2] 

First, note that 

where A = 1/(1 + ( 5 ) for 

5 = - P,)Xico)f 3 J{nal). 


Then, we observe that 


dRsiv*) ^ ^ 

dr]^ 86 drj^ 

and that dX/d 6 = —(1 + After some calculations, we obtain Lemma[ 2 l 


□ 


A.3 Proof of Lemma [3] 

First, note that E[R3{^)] is expanded as 

ElRsiv)] = Rsiv.) + BM + O(n-i), 

where Bi{r]^) is given as (fTTI) . Because Bi{r]^) = 0 ( 1 ), it follows that Bi{^) = Bi{r]^) + 0 {n~^), 
which shows that ___ 

E[R3] = R3 + Oin-^). 

In the same way, we obtain ElR^] = E[Ri{ri)] = Ri{ri^,) + 0 {n~^). □ 


A.4 Proof of Lemma [4] 

It follows from (I 15 p that 

ElRs] = E [ 2 Rs{v) - Ej^lRsiv^)] + E 7 ,[B,{^^)] - 

Because E[R3{^)] is expanded as E[R3{^)] = R3{r]^) + Bi{rj^) + -62(77*) + 0 (n“^), we observe 
that 


E 


263(^7) - ^fi[^3(^7^)]] = 2 {63(77*) + 61(77*) + 62(77*)} - E [63(^7) + 61 (?7) + 62(^7)] + 0 (n- 2 ) 

= 63(77*) + 61(77*) + 62(77*) - E [61 (^) + 62(^)] + 0 (n- 2 ). 


Moreover, because 6(61(77)] = 61(77*) + 611(77*) + 0 {n ^) and 6(62(77)] = 62(77*) + 0 {n 
the equation above can be rewritten as 


6 


263(^7) - E^[R 3 { 7 }^)]\ = 63(77*) - 611(77*) + 0(77-2). 


( 25 ) 


Next, it is observed that 


6 


65^(61 (^t)] _ + Bniv)] - {61(77*) + 611(77*)} + 0(n-2) 

= 611(77*)+ 0(77-2). ( 26 ) 

Thus, it follows from ( 1251 ) and (f 26 ]l that 

6(63] =63(77*) +0(77-2). 

Similarly, we show that 6(64] = 64(77*) + 0(77-2). □ 
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